Goto

Collaborating Authors

 expression data





Supervised Graph Contrastive Learning for Gene Regulatory Networks

Oshima, Sho, Okamoto, Yuji, Tosaki, Taisei, Kojima, Ryosuke, Okuno, Yasushi

arXiv.org Artificial Intelligence

Graph Contrastive Learning (GCL) is a powerful self-supervised learning framework that performs data augmentation through graph perturbations, with growing applications in the analysis of biological networks such as Gene Regulatory Networks (GRNs). The artificial perturbations commonly used in GCL, such as node dropping, induce structural changes that can diverge from biological reality. This concern has contributed to a broader trend in graph representation learning toward augmentation-free methods, which view such structural changes as problematic and to be avoided. However, this trend overlooks the fundamental insight that structural changes from biologically meaningful perturbations are not a problem to be avoided but a rich source of information, thereby ignoring the valuable opportunity to leverage data from real biological experiments. Motivated by this insight, we propose SupGCL (Supervised Graph Contrastive Learning), a new GCL method for GRNs that directly incorporates biological perturbations from gene knockdown experiments as supervision. SupGCL is a probabilistic formulation that continuously generalizes conventional GCL, linking artificial augmentations with real perturbations measured in knockdown experiments and using the latter as explicit supervisory signals. To assess effectiveness, we train GRN representations with SupGCL and evaluate their performance on downstream tasks. The evaluation includes both node-level tasks, such as gene function classification, and graph-level tasks on patient-specific GRNs, such as patient survival hazard prediction. Across 13 tasks built from GRN datasets derived from patients with three cancer types, SupGCL consistently outperforms state-of-the-art baselines. Graph representation learning has recently attracted attention in various fields to learn a meaningful latent space to represent the connectivity and attributes in given graphs (Ju et al., 2024). The application of graph representation learning to Gene Regulatory Networks (GRNs), which contain information about intracellular functions and processes, is particularly important in the fields of biology and drug discovery. It is expected to contribute to the identification of therapeutic targets and the elucidation of disease mechanisms. Representation learning for GRNs has been applied to tasks such as transcription factor inference (Y u et al., 2025) and predicting drug responses in cancer cell lines (Liu et al., 2022). Advances in gene expression measurement and analysis technologies have enabled the construction of patient-specific GRNs, highlighting gene regulation patterns that differ from the population as a whole (Nakazawa et al., 2021). Hereafter, this paper will refer to such individualized networks simply as GRNs.


Enhanced Single-Cell RNA-seq Embedding through Gene Expression and Data-Driven Gene-Gene Interaction Integration

Goudarzi, Hojjat Torabi, Pouyan, Maziyar Baran

arXiv.org Artificial Intelligence

Single-cell RNA sequencing (scRNA-seq) provides unprecedented insights into cellular heterogeneity, enabling detailed analysis of complex biological systems at single-cell resolution. However, the high dimensionality and technical noise inherent in scRNA-seq data pose significant analytical challenges. While current embedding methods focus primarily on gene expression levels, they often overlook crucial gene-gene interactions that govern cellular identity and function. To address this limitation, we present a novel embedding approach that integrates both gene expression profiles and data-driven gene-gene interactions. Our method first constructs a Cell-Leaf Graph (CLG) using random forest models to capture regulatory relationships between genes, while simultaneously building a K-Nearest Neighbor Graph (KNNG) to represent expression similarities between cells. These graphs are then combined into an Enriched Cell-Leaf Graph (ECLG), which serves as input for a graph neural network to compute cell embeddings. By incorporating both expression levels and gene-gene interactions, our approach provides a more comprehensive representation of cellular states. Extensive evaluation across multiple datasets demonstrates that our method enhances the detection of rare cell populations and improves downstream analyses such as visualization, clustering, and trajectory inference. This integrated approach represents a significant advance in single-cell data analysis, offering a more complete framework for understanding cellular diversity and dynamics.



Biclustering Usinig Message Passing

Luke O'Connor, Soheil Feizi

Neural Information Processing Systems

Biclustering is the analog of clustering on a bipartite graph. Existent methods infer biclusters through local search strategies that find one cluster at a time; a common technique is to update the row memberships based on the current column memberships, and vice versa. We propose a biclustering algorithm that maximizes a global objective function using message passing. Our objective function closely approximates a general likelihood function, separating a cluster size penalty term into row-and column-count penalties. Because we use a global optimization framework, our approach excels at resolving the overlaps between biclusters, which are important features of biclusters in practice. Moreover, Expectation-Maximization can be used to learn the model parameters if they are unknown. In simulations, we find that our method outperforms two of the best existing biclustering algorithms, ISA and LAS, when the planted clusters overlap. Applied to three gene expression datasets, our method finds coregulated gene clusters that have high quality in terms of cluster size and density.


Brain-wide interpolation and conditioning of gene expression in the human brain using Implicit Neural Representations

Yu, Xizheng, Torok, Justin, Pandya, Sneha, Pal, Sourav, Singh, Vikas, Raj, Ashish

arXiv.org Artificial Intelligence

In this paper, we study the efficacy and utility of recent advances in non-local, non-linear image interpolation and extrapolation algorithms, specifically, ideas based on Implicit Neural Representations (INR), as a tool for analysis of spatial transcriptomics data. We seek to utilize the microarray gene expression data sparsely sampled in the healthy human brain, and produce fully resolved spatial maps of any given gene across the whole brain at a voxel-level resolution. To do so, we first obtained the 100 top AD risk genes, whose baseline spatial transcriptional profiles were obtained from the Allen Human Brain Atlas (AHBA). We adapted Implicit Neural Representation models so that the pipeline can produce robust voxel-resolution quantitative maps of all genes. We present a variety of experiments using interpolations obtained from Abagen as a baseline/reference.


HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity

Sun, Xuejun, Song, Yiran, Zhou, Xiaochen, Cai, Ruilie, Zhang, Yu, Li, Xinyi, Peng, Rui, Xie, Jialiu, Yan, Yuanyuan, Tang, Muyao, Lakshmanane, Prem, Zou, Baiming, Hagood, James S., Pickles, Raymond J., Li, Didong, Zou, Fei, Zheng, Xiaojing

arXiv.org Artificial Intelligence

Respiratory viral infections pose a global health burden, yet the cellular immune responses driving protection or pathology remain unclear. Natural infection cohorts often lack pre-exposure baseline data and structured temporal sampling. In contrast, inoculation and vaccination trials generate insightful longitudinal transcriptomic data. However, the scattering of these datasets across platforms, along with inconsistent metadata and preprocessing procedure, hinders AI-driven discovery. To address these challenges, we developed the Human Respiratory Viral Immunization LongitudinAl Gene Expression (HR-VILAGE-3K3M) repository: an AI-ready, rigorously curated dataset that integrates 14,136 RNA-seq profiles from 3,178 subjects across 66 studies encompassing over 2.56 million cells. Spanning vaccination, inoculation, and mixed exposures, the dataset includes microarray, bulk RNA-seq, and single-cell RNA-seq from whole blood, PBMCs, and nasal swabs, sourced from GEO, ImmPort, and ArrayExpress. We harmonized subject-level metadata, standardized outcome measures, applied unified preprocessing pipelines with rigorous quality control, and aligned all data to official gene symbols. To demonstrate the utility of HR-VILAGE-3K3M, we performed predictive modeling of vaccine responders and evaluated batch-effect correction methods. Beyond these initial demonstrations, it supports diverse systems immunology applications and benchmarking of feature selection and transfer learning algorithms. Its scale and heterogeneity also make it ideal for pretraining foundation models of the human immune response and for advancing multimodal learning frameworks. As the largest longitudinal transcriptomic resource for human respiratory viral immunization, it provides an accessible platform for reproducible AI-driven research, accelerating systems immunology and vaccine development against emerging viral threats.


Integrating Single-Cell Foundation Models with Graph Neural Networks for Drug Response Prediction

Rossner, Till, Li, Ziteng, Balke, Jonas, Salehfard, Nikoo, Seifert, Tom, Tang, Ming

arXiv.org Artificial Intelligence

AI-driven drug response prediction holds great promise for advancing personalized cancer treatment. However, the inherent heterogenity of cancer and high cost of data generation make accurate prediction challenging. In this study, we investigate whether incorporating the pretrained foundation model scGPT can enhance the performance of existing drug response prediction frameworks. Our approach builds on the DeepCDR framework, which encodes drug representations from graph structures and cell representations from multi-omics profiles. We adapt this framework by leveraging scGPT to generate enriched cell representations using its pretrained knowledge to compensate for limited amount of data. We evaluate our modified framework using IC$_{50}$ values on Pearson correlation coefficient (PCC) and a leave-one-drug out validation strategy, comparing it against the original DeepCDR framework and a prior scFoundation-based approach. scGPT not only outperforms previous approaches but also exhibits greater training stability, highlighting the value of leveraging scGPT-derived knowledge in this domain.